library(tidyverse)
library(bayesrules)
library(bayesplot)
library(rstan)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(ggmosaic)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.2 ✔ readr 2.1.4 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.4.2 ✔ tibble 3.2.1 ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ✔ purrr 1.0.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors This is bayesplot version 1.10.0 - Online documentation and vignettes at mc-stan.org/bayesplot - bayesplot theme set to bayesplot::theme_default() * Does _not_ affect other ggplot2 plots * See ?bayesplot_theme_set for details on theme setting Loading required package: StanHeaders rstan (Version 2.21.8, GitRev: 2e1f913d3ca3) For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()). To avoid recompilation of unchanged Stan programs, we recommend calling rstan_options(auto_write = TRUE) Attaching package: ‘rstan’ The following object is masked from ‘package:tidyr’: extract Loading required package: Rcpp This is rstanarm version 2.21.4 - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors! - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults. - For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()) Attaching package: ‘rstanarm’ The following object is masked from ‘package:rstan’: loo
head( hotel_bookings )
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ⋯ | deposit_type | agent | company | days_in_waiting_list | customer_type | average_daily_rate | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <fct> | <fct> | <dbl> | <dbl> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <fct> | <fct> | <fct> | <dbl> | <fct> | <dbl> | <dbl> | <dbl> | <fct> | <date> |
| City Hotel | 1 | 1 | 2015 | September | 40 | 30 | 0 | 2 | 1 | ⋯ | Non Refund | 50 | NULL | 0 | Transient | 98.10 | 0 | 0 | Canceled | 2015-09-29 |
| Resort Hotel | 1 | 19 | 2016 | March | 12 | 19 | 2 | 4 | 2 | ⋯ | No Deposit | 240 | NULL | 0 | Transient | 70.17 | 0 | 1 | Canceled | 2016-03-02 |
| Resort Hotel | 0 | 9 | 2017 | August | 31 | 1 | 0 | 4 | 2 | ⋯ | No Deposit | 241 | NULL | 0 | Transient | 193.40 | 0 | 1 | Check-Out | 2017-08-05 |
| Resort Hotel | 0 | 110 | 2016 | November | 46 | 11 | 0 | 1 | 2 | ⋯ | No Deposit | 314 | NULL | 0 | Transient | 36.24 | 1 | 0 | Check-Out | 2016-11-12 |
| City Hotel | 0 | 329 | 2017 | July | 30 | 27 | 0 | 2 | 2 | ⋯ | No Deposit | 9 | NULL | 0 | Transient | 89.10 | 0 | 1 | Check-Out | 2017-07-29 |
| Resort Hotel | 0 | 212 | 2017 | August | 35 | 31 | 2 | 8 | 2 | ⋯ | No Deposit | 143 | NULL | 0 | Transient | 89.75 | 0 | 0 | Check-Out | 2017-09-10 |
mean( hotel_bookings$is_canceled == '1' )
37% of bookings are canceled. This is a lot!
Compute means per class:
hotel_bookings %>%
select( is_canceled, lead_time, previous_cancellations, is_repeated_guest, average_daily_rate ) %>%
group_by( is_canceled ) %>%
summarise_all( list(mean=mean, sd=sd) )
| is_canceled | lead_time_mean | previous_cancellations_mean | is_repeated_guest_mean | average_daily_rate_mean | lead_time_sd | previous_cancellations_sd | is_repeated_guest_sd | average_daily_rate_sd |
|---|---|---|---|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 0 | 83.45899 | 0.007886435 | 0.03943218 | 97.86945 | 93.90104 | 0.1636886 | 0.1947745 | 46.75579 |
| 1 | 142.49180 | 0.267759563 | 0.01912568 | 106.99902 | 116.21010 | 1.9029606 | 0.1371542 | 43.31366 |
From a comparison of mean differences and standard deviations, no strong separations of density plots are expected.
Linear scale:
options(repr.plot.width=15, repr.plot.height=5)
plot <- ggplot( hotel_bookings, aes(x=lead_time, fill=is_canceled)) + geom_density( alpha=0.5 )
plot
Log scale:
plot + scale_x_log10()
Warning message: “Transformation introduced infinite values in continuous x-axis” Warning message: “Removed 44 rows containing non-finite values (`stat_density()`).”
Boxplot:
ggplot( hotel_bookings, aes(x=lead_time, y=is_canceled) ) + geom_boxplot()
It appears that the longer the room is booked in advanced, the larger the probability of cancelation. This makes sense, if one books only 1-2 weeks before, then one is typically sure that everything will work out.
Check distribution of previous cancellations:
hotel_bookings %>%
group_by( previous_cancellations ) %>%
summarise( n=n() )
| previous_cancellations | n |
|---|---|
| <dbl> | <int> |
| 0 | 949 |
| 1 | 48 |
| 4 | 1 |
| 25 | 1 |
| 26 | 1 |
Most of the guests did not cancel previously. Considering the actual number of previous cancellations might lead to severe overfitting. As an alternative, one could include a binary variable 'has_canceled_previously'.
Density plot:
ggplot( hotel_bookings, aes(x=previous_cancellations, fill=is_canceled)) +
geom_density( alpha=0.5 ) +
scale_x_log10()
Warning message: “Transformation introduced infinite values in continuous x-axis” Warning message: “Removed 949 rows containing non-finite values (`stat_density()`).”
This is a dangerous visualisation and it is clearly visible that the data are sparse.
Mosaic plot:
ggplot( data=hotel_bookings %>% mutate(has_canceled_previously=previous_cancellations>0) ) +
geom_mosaic( aes(x=product(is_canceled, has_canceled_previously), fill=is_canceled) )
Warning message: “`unite_()` was deprecated in tidyr 1.2.0. ℹ Please use `unite()` instead. ℹ The deprecated feature was likely used in the ggmosaic package. Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.”
It appears that a previous cancellation is an indicator for a cancellation now.
Table:
hotel_bookings %>%
group_by( is_canceled, previous_cancellations>0 ) %>%
summarise( n=n() )
`summarise()` has grouped output by 'is_canceled'. You can override using the `.groups` argument.
| is_canceled | previous_cancellations > 0 | n |
|---|---|---|
| <fct> | <lgl> | <int> |
| 0 | FALSE | 632 |
| 0 | TRUE | 2 |
| 1 | FALSE | 317 |
| 1 | TRUE | 49 |
Mosaic plot:
ggplot( data=hotel_bookings ) + geom_mosaic( aes(x=product(is_canceled, is_repeated_guest), fill=is_canceled) )
Table:
hotel_bookings %>%
group_by( is_canceled, is_repeated_guest ) %>%
summarise( n=n() )
`summarise()` has grouped output by 'is_canceled'. You can override using the `.groups` argument.
| is_canceled | is_repeated_guest | n |
|---|---|---|
| <fct> | <dbl> | <int> |
| 0 | 0 | 609 |
| 0 | 1 | 25 |
| 1 | 0 | 359 |
| 1 | 1 | 7 |
Of the few repeated guests, slightly less appear to cancel their reservation, however the numbers are too low for a clear answer. There is also the risk for overfitting to the small number here.
ggplot( hotel_bookings, aes(x=average_daily_rate, fill=is_canceled)) + geom_density( alpha=0.5 )
The differences appear to be small and probably not significant. At least there is no clear pattern that emerges.
Prior on intercept: Reasonable range for $\pi$ between 0.1 and 0.8, centered at 0.4. In terms of log-odds:
log(0.4/(1-0.4))
A probability of 0.1 corresponds to a log-odds of $\approx -2$ and 0.8 to a log-odds of $\approx 1.4$.
log(0.1/(1-0.1))
log(0.8/(1-0.8))
This can be more or less reproduced with a Gaussian prior centered around -0.4 with a standard deviation of 0.8:
plot_normal( mean=-0.4, sd=0.8 )
Resulting model:
$$Y_i|\beta_0,\beta_1,\beta_2,\beta_3,\beta_4 \sim \text{Bern}(\pi_i), \quad \log\left(\frac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \beta_4 X_{4i} $$$$\beta_0 \sim N(0.4,0.8)$$$$\beta_1 \sim N(0,\cdot)$$$$\beta_2 \sim N(0,\cdot)$$$$\beta_3 \sim N(0,\cdot)$$$$\beta_4 \sim N(0,\cdot)$$where $\cdot$ stands for autoscale=TRUE (weakly informed prior).
Since the outcome is of (binary) categorical nature, it is best modeled with a Bernoulli distribution. The parameter $\pi$ of the distribution needs to be modeled as a probablity in the interval [0,1], what is done best with a sigmoid function or equally by modeling the log-odds with a linear ansatz. Since $\beta_1$, .., $\beta_4$ are again parameters of a linear model, we can assume Gaussian priors as usual.
Simulate the model:
booking_model_1 <- stan_glm( is_canceled ~ lead_time + previous_cancellations + is_repeated_guest + average_daily_rate,
data = hotel_bookings, family = binomial,
prior_intercept = normal(0.4, 0.8),
prior = normal(0, 1, autoscale=TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1). Chain 1: Chain 1: Gradient evaluation took 2.5e-05 seconds Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds. Chain 1: Adjust your expectations accordingly! Chain 1: Chain 1: Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 1: Iteration: 10000 / 10000 [100%] (Sampling) Chain 1: Chain 1: Elapsed Time: 0.554266 seconds (Warm-up) Chain 1: 0.611307 seconds (Sampling) Chain 1: 1.16557 seconds (Total) Chain 1: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2). Chain 2: Chain 2: Gradient evaluation took 2.1e-05 seconds Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds. Chain 2: Adjust your expectations accordingly! Chain 2: Chain 2: Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 2: Iteration: 10000 / 10000 [100%] (Sampling) Chain 2: Chain 2: Elapsed Time: 0.582348 seconds (Warm-up) Chain 2: 0.607858 seconds (Sampling) Chain 2: 1.19021 seconds (Total) Chain 2: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3). Chain 3: Chain 3: Gradient evaluation took 2.2e-05 seconds Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds. Chain 3: Adjust your expectations accordingly! Chain 3: Chain 3: Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 3: Iteration: 10000 / 10000 [100%] (Sampling) Chain 3: Chain 3: Elapsed Time: 0.54983 seconds (Warm-up) Chain 3: 0.618881 seconds (Sampling) Chain 3: 1.16871 seconds (Total) Chain 3: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4). Chain 4: Chain 4: Gradient evaluation took 2e-05 seconds Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds. Chain 4: Adjust your expectations accordingly! Chain 4: Chain 4: Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 4: Iteration: 10000 / 10000 [100%] (Sampling) Chain 4: Chain 4: Elapsed Time: 0.549079 seconds (Warm-up) Chain 4: 0.596707 seconds (Sampling) Chain 4: 1.14579 seconds (Total) Chain 4:
Prior summary:
prior_summary( booking_model_1 )
Priors for model 'booking_model_1'
------
Intercept (after predictors centered)
~ normal(location = 0.4, scale = 0.8)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [1,1,1,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [0.0094,0.8588,5.6790,...])
------
See help('prior_summary.stanreg') for more details
Diagnostics:
mcmc_trace( booking_model_1 )
mcmc_dens_overlay( booking_model_1 )
mcmc_acf( booking_model_1 )
Warning message: “The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0. ℹ Please use the `rows` argument instead. ℹ The deprecated feature was likely used in the bayesplot package. Please report the issue at <https://github.com/stan-dev/bayesplot/issues/>.”
Looks good!
Posterior predictive check:
proportion_of_cancellations <- function(x){mean(x == 1)}
pp_check(booking_model_1, nreps = 100,
plotfun = "stat", stat = "proportion_of_cancellations") + xlab("probability of cancellations")
Warning message: “'nreps' is ignored for this PPC” `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Appears to be in an ok range.
tidy(booking_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | -1.909140726 | 0.206106095 | -2.178027271 | -1.645967782 |
| lead_time | 0.004824958 | 0.000689820 | 0.003950848 | 0.005728844 |
| previous_cancellations | 2.216640092 | 0.387012270 | 1.733796500 | 2.716642477 |
| is_repeated_guest | -0.745651735 | 0.544346703 | -1.469581416 | -0.080961138 |
| average_daily_rate | 0.007238742 | 0.001571894 | 0.005247197 | 0.009247278 |
$\beta_2$:
c( exp(1.73), exp(2.22), exp(2.72) )
With every previous cancellation, the odds of a cancellation increase by a factor between 5.6 and 15.2! (As stated before, I think this is very dangerous, I would prefer a field 'has_canceled_previously')
$\beta_3$:
c( exp(-1.47), exp(-0.75), exp(-0.08))
If the guest has stayed at the hotel before, the odds of cancellation appear to be between cut to one fourth or almost staying the same, what is quite a big uncertainty (also here I see a risk for overfitting to the small number).
95% confidence interval:
tidy(booking_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | -1.909140726 | 0.206106095 | -2.325342491 | -1.508698267 |
| lead_time | 0.004824958 | 0.000689820 | 0.003494805 | 0.006213864 |
| previous_cancellations | 2.216640092 | 0.387012270 | 1.493876016 | 3.010409575 |
| is_repeated_guest | -0.745651735 | 0.544346703 | -1.890735848 | 0.226827159 |
| average_daily_rate | 0.007238742 | 0.001571894 | 0.004203853 | 0.010290770 |
At the 95%-level it appears that lead_time, previous_cancellations and average daily rate are significantly related to the cancellation rate. The 95%-CI for is_repeated_guest encloses zero and is therefore not significant at this level. From the considerations in exercise 13.7 I have however my doubts how well 'previous_cancellations' and 'average_daily_rate' work in reality.
In-sample:
classification_summary(model = booking_model_1, data = hotel_bookings, cutoff = 0.5)
| y | 0 | 1 |
|---|---|---|
| <fct> | <dbl> | <dbl> |
| 0 | 574 | 60 |
| 1 | 250 | 116 |
| <dbl> | |
|---|---|
| sensitivity | 0.3169399 |
| specificity | 0.9053628 |
| overall_accuracy | 0.6900000 |
CV:
cv_accuracy <- classification_summary_cv(model = booking_model_1, data = hotel_bookings, cutoff = 0.5, k = 10)
cv_accuracy$cv
| sensitivity | specificity | overall_accuracy |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0.3019413 | 0.9048041 | 0.684 |
cv_accuracy$folds$overall_accuracy
Sensitivity, specificity and overall accuracy are very similar. In general this makes sense, because it is hard for a linear model to overfit to data. However I am surprised that the low numbers in previous_cancellations and is_repeated_guest did not have a bigger impact.
The model will correctly predict whether a room is canceled or not in 69% of the cases. Out of all the rooms that will be canceled, the model can only find 32% and out of all rooms that will not be canceled, it can find 91%.
find_sens_spec <- function( cutoff ) {
sm <- classification_summary(model = booking_model_1, data = hotel_bookings, cutoff = cutoff)
c( cutoff, sm$accuracy_rates[[1]] )
}
df <- data.frame( matrix(nrow=0, ncol=4) )
colnames(df) <- c("cutoff", "sensitivity", "specificity", "overall_accuracy")
for (c in seq(0.1,0.95,0.05)) {
df[nrow(df)+1,] = find_sens_spec( c )
}
df %>%
pivot_longer( sensitivity:overall_accuracy, names_to="metric" ) %>%
ggplot( aes(x=cutoff, y=value, color=metric) ) +
geom_line() +
geom_point() +
geom_hline( yintercept=0.75, linetype="dashed", color = "black" )
df %>% filter( abs(sensitivity-0.75) < 0.2 )
| cutoff | sensitivity | specificity | overall_accuracy |
|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> |
| 0.25 | 0.8907104 | 0.3359621 | 0.539 |
| 0.30 | 0.7841530 | 0.5047319 | 0.607 |
| 0.35 | 0.6256831 | 0.6545741 | 0.644 |
For a sensitivity close to 75%, i.e. capturing about 75% of all cancellations, a cutoff threshold of 0.3 needs to be set. This results in a specificity of 51%, i.e. half of all non-cancellations are predicted as cancellations (false positives).
Simulate:
booking_model_1_df <- data.frame(booking_model_1)
preds <- booking_model_1_df %>%
mutate( mu=X.Intercept. + lead_time*30 + is_repeated_guest*0 + previous_cancellations*1 + average_daily_rate*100 ) %>%
mutate( odds=exp(mu), prob=odds/(odds+1) ) %>%
mutate( Y=rbinom(nrow(booking_model_1_df), size=1, prob=prob) )
head( preds )
| X.Intercept. | lead_time | previous_cancellations | is_repeated_guest | average_daily_rate | mu | odds | prob | Y | |
|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | |
| 1 | -2.067350 | 0.004819027 | 2.652966 | -0.6439337 | 0.008666092 | 1.5967956 | 4.937186 | 0.8315700 | 0 |
| 2 | -2.004914 | 0.003755334 | 2.524631 | -0.5101437 | 0.008897862 | 1.5221634 | 4.582128 | 0.8208568 | 1 |
| 3 | -2.078681 | 0.004511512 | 2.720257 | -1.1924180 | 0.009052037 | 1.6821243 | 5.376966 | 0.8431856 | 1 |
| 4 | -2.211178 | 0.004633320 | 2.262709 | -0.6388802 | 0.009974415 | 1.1879714 | 3.280420 | 0.7663781 | 1 |
| 5 | -1.619227 | 0.004681103 | 1.912430 | -0.7522682 | 0.005492169 | 0.9828526 | 2.672068 | 0.7276739 | 1 |
| 6 | -1.690405 | 0.005637259 | 2.433125 | -0.7060644 | 0.004518225 | 1.3636604 | 3.910481 | 0.7963540 | 1 |
Distribution of probabilities:
ggplot( preds ) + geom_histogram( aes(x=prob) ) + geom_vline( xintercept=0.5, color="red", linetype="dashed" )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Distribution of model predictions:
ggplot( preds, aes(x = Y) ) + geom_bar( aes(y=..prop.., group=1))
Warning message:
“The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.”
mean( preds$Y )
The model predicts a cancellation in 76% of all the time. In a bet I would place my money in favour of the guest cancelling.
This can for example be achieved by setting the number of previous cancellations higher. Then it does not really matter what the rest of the variables is:
preds2 <- booking_model_1_df %>%
mutate( mu=X.Intercept. + lead_time*30 + is_repeated_guest*0 + previous_cancellations*5 + average_daily_rate*250 ) %>%
mutate( odds=exp(mu), prob=odds/(odds+1) ) %>%
mutate( Y=rbinom(nrow(booking_model_1_df), size=1, prob=prob) )
mean(preds2$Y)
ggplot( preds2 ) + geom_histogram( aes(x=prob) ) + geom_vline( xintercept=0.5, color="red", linetype="dashed" )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I would advocate here for not treating previous_cancellations as a numeric variable.
robot_data <- pulse_of_the_nation %>% select( robots, transformers, books, age, income )
head( robot_data )
| robots | transformers | books | age | income |
|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <dbl> | <dbl> |
| Unlikely | 1 | 20 | 64 | 8 |
| Unlikely | 0 | 6 | 56 | 68 |
| Unlikely | 0 | 0 | 63 | 46 |
| Unlikely | 0 | 1 | 48 | 51 |
| Unlikely | 1 | 30 | 32 | 100 |
| Unlikely | 0 | 15 | 64 | 54 |
robot_data %>% group_by( robots ) %>% summarise( unlikely_prop=n()/nrow(robot_data) )
| robots | unlikely_prop |
|---|---|
| <fct> | <dbl> |
| Likely | 0.199 |
| Unlikely | 0.801 |
Around 80% believe that it is unlikely.
robot_data %>%
group_by( transformers ) %>%
summarise( count=n() )
| transformers | count |
|---|---|
| <dbl> | <int> |
| 0 | 459 |
| 1 | 157 |
| 2 | 155 |
| 3 | 111 |
| 4 | 67 |
| 5 | 51 |
ggplot( data=robot_data ) + geom_mosaic( aes(x=product(robots, transformers), fill=robots) )
It appears that watching 2 or more transformer movies very slightly increases the belief that robots will take over. This might not be significant.
ggplot( robot_data ) + geom_histogram( aes(x=books, fill=robots), alpha=0.5 )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks like people who read few books believe it is more likely that robots will take over the world, however this is might not be significant.
ggplot( robot_data ) + geom_density( aes(x=age, fill=robots), alpha=0.5 )
It looks like people above 70 believe it is more likely that robots will take over the world.
ggplot( robot_data ) + geom_density( aes(x=income, fill=robots), alpha=0.5 )
It appears that people with low income believe it is more likely that robots will take over the world.
Prior on intercept: Reasonable range for $\pi$ between 0.5 and 0.95, centered at 0.8. In terms of log-odds:
log(0.8/(1-0.8))
A probability of 0.5 corresponds to a log-odds of $0$ and 0.95 to a log-odds of $\approx 3$.
log(0.5/(1-0.5))
log(0.95/(1-0.95))
This can be more or less reproduced with a Gaussian prior centered around 1.4 with a standard deviation of 0.8:
plot_normal( mean=1.4, sd=0.8 )
Resulting model:
$$Y_i|\beta_0,\beta_1,\beta_2,\beta_3,\beta_4 \sim \text{Bern}(\pi_i), \quad \log\left(\frac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \beta_4 X_{4i} $$$$\beta_0 \sim N(0.4,0.8)$$$$\beta_1 \sim N(0,\cdot)$$$$\beta_2 \sim N(0,\cdot)$$$$\beta_3 \sim N(0,\cdot)$$$$\beta_4 \sim N(0,\cdot)$$where $\cdot$ stands for autoscale=TRUE (weakly informed prior).
Since the outcome (likely/unlikely) is of binary categorical nature, it is best modeled with a Bernoulli distribution. The parameter $\pi$ of the distribution needs to be modeled as a probablity in the interval [0,1], what is done best with a sigmoid function or equally by modeling the log-odds with a linear ansatz. Since $\beta_1$, .., $\beta_4$ are again parameters of a linear model, we can assume Gaussian priors as usual.
Simulate the model:
robot_model_1 <- stan_glm( robots ~ transformers + books + age + income,
data = robot_data, family = binomial,
prior_intercept = normal(1.4, 0.8),
prior = normal(0, 1, autoscale=TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1). Chain 1: Chain 1: Gradient evaluation took 2.4e-05 seconds Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds. Chain 1: Adjust your expectations accordingly! Chain 1: Chain 1: Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 1: Iteration: 10000 / 10000 [100%] (Sampling) Chain 1: Chain 1: Elapsed Time: 0.464295 seconds (Warm-up) Chain 1: 0.559222 seconds (Sampling) Chain 1: 1.02352 seconds (Total) Chain 1: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2). Chain 2: Chain 2: Gradient evaluation took 2.1e-05 seconds Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds. Chain 2: Adjust your expectations accordingly! Chain 2: Chain 2: Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 2: Iteration: 10000 / 10000 [100%] (Sampling) Chain 2: Chain 2: Elapsed Time: 0.482915 seconds (Warm-up) Chain 2: 0.519015 seconds (Sampling) Chain 2: 1.00193 seconds (Total) Chain 2: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3). Chain 3: Chain 3: Gradient evaluation took 2.9e-05 seconds Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.29 seconds. Chain 3: Adjust your expectations accordingly! Chain 3: Chain 3: Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 3: Iteration: 10000 / 10000 [100%] (Sampling) Chain 3: Chain 3: Elapsed Time: 0.470337 seconds (Warm-up) Chain 3: 0.527173 seconds (Sampling) Chain 3: 0.99751 seconds (Total) Chain 3: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4). Chain 4: Chain 4: Gradient evaluation took 2.2e-05 seconds Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds. Chain 4: Adjust your expectations accordingly! Chain 4: Chain 4: Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 4: Iteration: 10000 / 10000 [100%] (Sampling) Chain 4: Chain 4: Elapsed Time: 0.4765 seconds (Warm-up) Chain 4: 0.570284 seconds (Sampling) Chain 4: 1.04678 seconds (Total) Chain 4:
Prior summary:
prior_summary( robot_model_1 )
Priors for model 'robot_model_1'
------
Intercept (after predictors centered)
~ normal(location = 1.4, scale = 0.8)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [1,1,1,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [0.649,0.013,0.060,...])
------
See help('prior_summary.stanreg') for more details
Diagnostics:
mcmc_trace( robot_model_1 )
mcmc_dens_overlay( robot_model_1 )
mcmc_acf( robot_model_1 )
Looks good!
Posterior predictive check:
proportion_of_unbelievers <- function(x){mean(x == 1)}
pp_check(robot_model_1, nreps = 100,
plotfun = "stat", stat = "proportion_of_unbelievers") + xlab("probability of not believing")
Warning message: “'nreps' is ignored for this PPC” `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Appears to be in an ok range.
tidy(robot_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | 1.139334821 | 0.3480184360 | 0.692670549 | 1.582254e+00 |
| transformers | -0.121588725 | 0.0510392103 | -0.188425271 | -5.465837e-02 |
| books | -0.002080105 | 0.0009633804 | -0.003376087 | -8.505848e-04 |
| age | -0.006461826 | 0.0050434602 | -0.012965282 | -5.122908e-06 |
| income | 0.009800610 | 0.0019556609 | 0.007365586 | 1.237479e-02 |
$\beta_3$:
c( exp(-0.013), exp(-0.006), exp(-5.1e-6) )
Age has a very small impact. With every year more, around 0-1% more people believe that robots take over the world.
$\beta_4$:
c( exp(0.0074), exp(0.0098), exp(0.012) )
With every dollar of income, between 0-1% less people believe that robots take over the world.
95% confidence interval:
tidy(robot_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | 1.139334821 | 0.3480184360 | 0.453066782 | 1.8229782306 |
| transformers | -0.121588725 | 0.0510392103 | -0.223538160 | -0.0190189544 |
| books | -0.002080105 | 0.0009633804 | -0.004142485 | -0.0001981823 |
| age | -0.006461826 | 0.0050434602 | -0.016463324 | 0.0033468971 |
| income | 0.009800610 | 0.0019556609 | 0.006090984 | 0.0137529405 |
At the 95%-level it appears that transformers, book and income are significantly related to the belief that robots take over the world within the next 10 years. Age does not appear to be significantly related to this belief.
At the 99% confidence level, also transformers and books drop out, what is somehow expected:
tidy(robot_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.99)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | 1.139334821 | 0.3480184360 | 0.240898975 | 2.0484447280 |
| transformers | -0.121588725 | 0.0510392103 | -0.254841633 | 0.0140178974 |
| books | -0.002080105 | 0.0009633804 | -0.004902721 | 0.0003656827 |
| age | -0.006461826 | 0.0050434602 | -0.019859359 | 0.0064090354 |
| income | 0.009800610 | 0.0019556609 | 0.005002534 | 0.0149679852 |
I could somehow imagine that number of books and income somehow correlate with education level and that age is correlated with understanding of technology. Probably the number of transformer movies is confounded with another quantity, I don't see directly how it impacts belief/non-belief in that robots will take over the world.
ggplot( robot_data ) + geom_boxplot( aes(x=age, y=transformers, group=transformers))
It appears that the number of watched transformer movies is also correlated with age and then also income.
robot_data %>% summarize_all( function(x) {mean(is.na(x))} )
| robots | transformers | books | age | income |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 0 | 0 | 0 | 0 | 0 |
In-sample:
classification_summary(model = robot_model_1, data = robot_data, cutoff = 0.6)
| y | 0 | 1 |
|---|---|---|
| <fct> | <dbl> | <dbl> |
| Likely | 8 | 191 |
| Unlikely | 7 | 794 |
| <dbl> | |
|---|---|
| sensitivity | 0.99126092 |
| specificity | 0.04020101 |
| overall_accuracy | 0.80200000 |
CV (Use 5 folds - there appears to be a problem with the cutoff, leading to missing values):
cv_accuracy <- classification_summary_cv(model = robot_model_1, data = robot_data, cutoff = 0.6, k = 5)
cv_accuracy$cv
| sensitivity | specificity | overall_accuracy |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0.9913289 | 0.03682964 | 0.801 |
cv_accuracy$folds
| fold | sensitivity | specificity | overall_accuracy |
|---|---|---|---|
| <int> | <dbl> | <dbl> | <dbl> |
| 1 | 0.9935897 | 0.02272727 | 0.780 |
| 2 | 0.9939759 | 0.05882353 | 0.835 |
| 3 | 0.9936709 | 0.00000000 | 0.785 |
| 4 | 0.9818182 | 0.05714286 | 0.820 |
| 5 | 0.9935897 | 0.04545455 | 0.785 |
Interestingly, the sensitivity is extremely large (if somebody does not believe that robots take over the world we capture them with quite some confidence), however the specificity is quite low, practically zero. It appears that our model cannot really predict who believes that robots take over the world with the given cutoff.
Sensitivity, specificity and overall accuracy are more or less similar. In general this makes sense, because it is hard for a linear model to overfit to data.
The model practically predicts 'unlikely' for almost all data points. With the given class proportion of 0.8, this results in a default accuracy of 80%. Obviously, the sensitivity is almost 100% and the specificity pracically zero percent.
fake_news <- fake_news %>% select( type, title_has_excl, title_words, negative )
head( fake_news )
| type | title_has_excl | title_words | negative | |
|---|---|---|---|---|
| <fct> | <lgl> | <int> | <dbl> | |
| 1 | fake | FALSE | 17 | 8.47 |
| 2 | real | FALSE | 18 | 4.74 |
| 3 | fake | TRUE | 16 | 3.33 |
| 4 | real | FALSE | 11 | 6.09 |
| 5 | fake | FALSE | 9 | 2.66 |
| 6 | real | FALSE | 12 | 3.02 |
ggplot( data=fake_news ) + geom_mosaic( aes(x=product(type, title_has_excl), fill=type) )
There appears to be a significant relationship between exclamation marks and fake news.
ggplot( data=fake_news ) + geom_density( aes(x=title_words, fill=type), alpha=0.5 )
Fake news appear to have more words in the title.
ggplot( data=fake_news ) + geom_density( aes(x=negative, fill=type), alpha=0.5 )
Fake news convey more often words with a negative sentiment.
Expect around 40% of fake news, but could also reasonably be between 10-90%.
log(0.4/(1-0.4))
log(0.1/(1-0.1))
log(0.9/(1-0.9))
The uncertainty in prior odds is reflected with the following normal distribution:
plot_normal( mean=-0.4, sd=0.8 )
fake_news <- fake_news %>%
mutate( is_fake=as.numeric(type=="fake") ) %>%
select( -type )
fake_model_1 <- stan_glm( is_fake ~ title_has_excl + title_words + negative,
data = fake_news, family = binomial,
prior_intercept = normal(-0.4, 0.8),
prior = normal(0, 1, autoscale=TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1). Chain 1: Chain 1: Gradient evaluation took 1.3e-05 seconds Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds. Chain 1: Adjust your expectations accordingly! Chain 1: Chain 1: Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 1: Iteration: 10000 / 10000 [100%] (Sampling) Chain 1: Chain 1: Elapsed Time: 0.120301 seconds (Warm-up) Chain 1: 0.139981 seconds (Sampling) Chain 1: 0.260282 seconds (Total) Chain 1: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2). Chain 2: Chain 2: Gradient evaluation took 1.1e-05 seconds Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds. Chain 2: Adjust your expectations accordingly! Chain 2: Chain 2: Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 2: Iteration: 10000 / 10000 [100%] (Sampling) Chain 2: Chain 2: Elapsed Time: 0.123034 seconds (Warm-up) Chain 2: 0.146218 seconds (Sampling) Chain 2: 0.269252 seconds (Total) Chain 2: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3). Chain 3: Chain 3: Gradient evaluation took 1.1e-05 seconds Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds. Chain 3: Adjust your expectations accordingly! Chain 3: Chain 3: Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 3: Iteration: 10000 / 10000 [100%] (Sampling) Chain 3: Chain 3: Elapsed Time: 0.12657 seconds (Warm-up) Chain 3: 0.151761 seconds (Sampling) Chain 3: 0.278331 seconds (Total) Chain 3: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4). Chain 4: Chain 4: Gradient evaluation took 1.1e-05 seconds Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds. Chain 4: Adjust your expectations accordingly! Chain 4: Chain 4: Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 4: Iteration: 10000 / 10000 [100%] (Sampling) Chain 4: Chain 4: Elapsed Time: 0.131461 seconds (Warm-up) Chain 4: 0.143539 seconds (Sampling) Chain 4: 0.275 seconds (Total) Chain 4:
mcmc_trace( fake_model_1 )
mcmc_dens_overlay( fake_model_1 )
mcmc_acf( fake_model_1 )
proportion_of_fake <- function(x){mean(x == 1)}
pp_check(fake_model_1, nreps = 100,
plotfun = "stat", stat = "proportion_of_fake") + xlab("probability of news being fake")
Warning message: “'nreps' is ignored for this PPC” `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The posterior distribution was narrowed down quite a bit!
tidy(fake_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | -2.9075219 | 0.75951530 | -4.452530849 | -1.4716072 |
| title_has_exclTRUE | 2.4434688 | 0.75895334 | 1.107301202 | 4.2078433 |
| title_words | 0.1113220 | 0.05745054 | -0.002195581 | 0.2253234 |
| negative | 0.3178338 | 0.15404234 | 0.028119176 | 0.6258343 |
The number of words in the title appears not to be significant at the 95%-level.
c( exp(1.11), exp(2.44), exp(4.21) )
If an exclamation mark is present, the odds for the news being fake increase by a factor between 3 and 67, centered at 11. This is quite a big range, more data are needed here.
c( exp(-0.00022), exp(0.11), exp(0.23) )
For every additional word, the odds for fake news increase by a factor of 1-1.26.
c( exp(0.028), exp(0.32), exp(0.63) )
Each percent of words with negative sentiment increases the odds for fake news by a factor of 1.02-1.88 (2-88%).
Overall the ranges are quite wide because of the limited amount of available data.
In-sample:
classification_summary(model = fake_model_1, data = fake_news, cutoff = 0.5)
| y | 0 | 1 |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0 | 84 | 6 |
| 1 | 32 | 28 |
| <dbl> | |
|---|---|
| sensitivity | 0.4666667 |
| specificity | 0.9333333 |
| overall_accuracy | 0.7466667 |
CV:
cv_accuracy <- classification_summary_cv(model = fake_model_1, data = fake_news, cutoff = 0.5, k = 10)
cv_accuracy$cv
| sensitivity | specificity | overall_accuracy |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0.3400397 | 0.9386111 | 0.7 |
Sensitivity, specificity and accuracy vary quite a bit!
cv_accuracy$folds
| fold | sensitivity | specificity | overall_accuracy |
|---|---|---|---|
| <int> | <dbl> | <dbl> | <dbl> |
| 1 | 0.6666667 | 0.8888889 | 0.8000000 |
| 2 | 0.2000000 | 0.9000000 | 0.6666667 |
| 3 | 0.0000000 | 0.8888889 | 0.5333333 |
| 4 | 0.0000000 | 0.9166667 | 0.7333333 |
| 5 | 0.4444444 | 1.0000000 | 0.6666667 |
| 6 | 0.2857143 | 1.0000000 | 0.6666667 |
| 7 | 0.6666667 | 0.9166667 | 0.8666667 |
| 8 | 0.4285714 | 0.8750000 | 0.6666667 |
| 9 | 0.3750000 | 1.0000000 | 0.6666667 |
| 10 | 0.3333333 | 1.0000000 | 0.7333333 |
Using a smaller K (K=10 means only 15 data points for hold-out set, use rather K=5 with 30 points):
cv_accuracy2 <- classification_summary_cv(model = fake_model_1, data = fake_news, cutoff = 0.5, k = 5)
cv_accuracy2$cv
| sensitivity | specificity | overall_accuracy |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0.467094 | 0.9345472 | 0.7466667 |
cv_accuracy2$folds
| fold | sensitivity | specificity | overall_accuracy |
|---|---|---|---|
| <int> | <dbl> | <dbl> | <dbl> |
| 1 | 0.5833333 | 0.9444444 | 0.8000000 |
| 2 | 0.4444444 | 0.9047619 | 0.7666667 |
| 3 | 0.5384615 | 0.9411765 | 0.7666667 |
| 4 | 0.3076923 | 0.9411765 | 0.6666667 |
| 5 | 0.4615385 | 0.9411765 | 0.7333333 |
sensitivity, specificity and accuracy are varying less, but still quite a bit.
pulse_data <- pulse_of_the_nation %>%
select( ghosts, income, age, education, science_is_honest, trump_approval )
head( pulse_data )
| ghosts | income | age | education | science_is_honest | trump_approval |
|---|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <fct> | <fct> | <fct> |
| Yes | 8 | 64 | College degree | Strongly Agree | Strongly disapprove |
| No | 68 | 56 | High school | Somewhat Agree | Strongly disapprove |
| No | 46 | 63 | Some college | Somewhat Agree | Somewhat Approve |
| No | 51 | 48 | High school | Somewhat Disagree | Strongly Approve |
| Yes | 100 | 32 | Some college | Strongly Agree | Somewhat Approve |
| No | 54 | 64 | Some college | Strongly Agree | Strongly disapprove |
ggplot( pulse_data ) + geom_density( aes(x=income, fill=ghosts), alpha=0.5 )
no obvious correlations, mabye people with medium incomes believe less in ghosts.
ggplot( pulse_data ) + geom_density( aes(x=age, fill=ghosts), alpha=0.5 )
No obvious correlations, older people appear to believe even so slightly more in ghosts.
ggplot( pulse_data ) + geom_mosaic( aes(x=product(ghosts, education), fill=ghosts) )
There appears to be a slight correlation between believing in ghosts and education, however not as strong as suspected.
ggplot( pulse_data ) + geom_mosaic( aes(x=product(ghosts, science_is_honest), fill=ghosts) )
Very small differences, probably not significant.
ggplot( pulse_data ) + geom_mosaic( aes(x=product(ghosts, trump_approval), fill=ghosts) )
Also here probably not significant.
Expect around 60% who believe in ghosts, but could also reasonably be between 20-90%.
log(0.6/(1-0.6))
log(0.2/(1-0.2))
log(0.9/(1-0.9))
The uncertainty in prior odds is more or less reflected with the following normal distribution:
plot_normal( mean=0.4, sd=0.6 )
pulse_data <- pulse_data %>%
mutate( ghosts=as.numeric(ghosts=="Yes") )
pulse_model_1 <- stan_glm( ghosts ~ age + income + education + science_is_honest + trump_approval,
data = pulse_data, family = binomial,
prior_intercept = normal(0.4, 0.6),
prior = normal(0, 1, autoscale=TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1). Chain 1: Chain 1: Gradient evaluation took 3.3e-05 seconds Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.33 seconds. Chain 1: Adjust your expectations accordingly! Chain 1: Chain 1: Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 1: Iteration: 10000 / 10000 [100%] (Sampling) Chain 1: Chain 1: Elapsed Time: 2.80784 seconds (Warm-up) Chain 1: 2.69246 seconds (Sampling) Chain 1: 5.5003 seconds (Total) Chain 1: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2). Chain 2: Chain 2: Gradient evaluation took 3.2e-05 seconds Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.32 seconds. Chain 2: Adjust your expectations accordingly! Chain 2: Chain 2: Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 2: Iteration: 10000 / 10000 [100%] (Sampling) Chain 2: Chain 2: Elapsed Time: 2.85831 seconds (Warm-up) Chain 2: 2.94854 seconds (Sampling) Chain 2: 5.80686 seconds (Total) Chain 2: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3). Chain 3: Chain 3: Gradient evaluation took 2.2e-05 seconds Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds. Chain 3: Adjust your expectations accordingly! Chain 3: Chain 3: Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 3: Iteration: 10000 / 10000 [100%] (Sampling) Chain 3: Chain 3: Elapsed Time: 2.87614 seconds (Warm-up) Chain 3: 3.39404 seconds (Sampling) Chain 3: 6.27018 seconds (Total) Chain 3: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4). Chain 4: Chain 4: Gradient evaluation took 2.8e-05 seconds Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds. Chain 4: Adjust your expectations accordingly! Chain 4: Chain 4: Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 4: Iteration: 10000 / 10000 [100%] (Sampling) Chain 4: Chain 4: Elapsed Time: 3.63366 seconds (Warm-up) Chain 4: 4.02295 seconds (Sampling) Chain 4: 7.6566 seconds (Total) Chain 4:
mcmc_trace( pulse_model_1 )
mcmc_dens_overlay( pulse_model_1 )
mcmc_acf( pulse_model_1 )
proportion_of_ghosts <- function(x){mean(x == 1)}
pp_check(pulse_model_1, nreps = 100,
plotfun = "stat", stat = "proportion_of_ghosts") + xlab("probability of believing in ghosts")
Warning message: “'nreps' is ignored for this PPC” `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
tidy(pulse_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | -1.7272156368 | 0.649998894 | -3.057031004 | -0.5112633730 |
| age | -0.0087975842 | 0.004071164 | -0.016826448 | -0.0008942501 |
| income | -0.0007426599 | 0.001290706 | -0.003374491 | 0.0017734899 |
| educationGraduate degree | -0.3036828995 | 0.198303966 | -0.688425521 | 0.0848380810 |
| educationHigh school | 0.3753208081 | 0.196722888 | -0.001248007 | 0.7708382558 |
| educationOther | 0.5684712488 | 0.500754317 | -0.445618556 | 1.5956771172 |
| educationSome college | 0.2669902207 | 0.174887687 | -0.072274375 | 0.6153956820 |
| science_is_honestSomewhat Agree | 1.1356768450 | 0.544904000 | 0.107309155 | 2.2914345614 |
| science_is_honestSomewhat Disagree | 0.8360796514 | 0.568171033 | -0.232795020 | 2.0257495431 |
| science_is_honestStrongly Agree | 0.9541872485 | 0.552602988 | -0.068699415 | 2.1235345838 |
| science_is_honestStrongly Disagree | 1.2913664396 | 0.573804169 | 0.217174697 | 2.4911675867 |
| trump_approvalSomewhat Approve | 0.5323032328 | 0.346767339 | -0.122920153 | 1.2204693874 |
| trump_approvalSomewhat disapprove | 0.4953353137 | 0.368243727 | -0.207054762 | 1.2346362902 |
| trump_approvalStrongly Approve | 0.5526476714 | 0.327415250 | -0.068107641 | 1.2005745420 |
| trump_approvalStrongly disapprove | 0.7249495652 | 0.311730683 | 0.128547244 | 1.3608364941 |
Significant at 95%-level: age, trump_approval: strongly disapprove, science_is_honest: somewhat agree, strongly disagree
Not significant: income, education, trump_approval: everything except strongly disapprove, science_is_honest: everything else
It remains to be investigated whether only some values of a discrete quantity are significant.
c( exp(-0.017), exp(-0.00088), exp(-0.00069) )
Even though the relation is significant, the effective appears to be small (around 0.1% more belief in ghosts for every year of living)
c( exp(0.13), exp(0.71), exp(1.34) )
If somebody strongly disapproves of Trump, the odds that they believe in ghosts are doubled! This seems indeed to be an interesting finding..
c( exp(0.11), exp(1.14), exp(2.29) )
c( exp(0.22), exp(1.29), exp(2.49) )
Strong factors! Hard to say whether this is actually true.
Only use significant predictors:
pulse_data2 <- pulse_data %>%
mutate( trump_disapprove=ifelse(trump_approval=="Strongly disapprove", 1, 0)) %>%
mutate( science_is_honest_somewhat_agree=ifelse(science_is_honest=="Somewhat Agree", 1, 0) ) %>%
mutate( science_is_honest_strongly_disagree=ifelse(science_is_honest=="Strongly Disagree", 1, 0) ) %>%
select( ghosts, age, trump_disapprove, science_is_honest_somewhat_agree, science_is_honest_strongly_disagree )
head( pulse_data2 )
| ghosts | age | trump_disapprove | science_is_honest_somewhat_agree | science_is_honest_strongly_disagree |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 1 | 64 | 1 | 0 | 0 |
| 0 | 56 | 1 | 1 | 0 |
| 0 | 63 | 0 | 1 | 0 |
| 0 | 48 | 0 | 0 | 0 |
| 1 | 32 | 0 | 0 | 0 |
| 0 | 64 | 1 | 0 | 0 |
pulse_model_2 <- stan_glm( ghosts ~ age + trump_disapprove + science_is_honest_somewhat_agree + science_is_honest_strongly_disagree,
data = pulse_data2, family = binomial,
prior_intercept = normal(0.4, 0.6),
prior = normal(0, 1, autoscale=TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1). Chain 1: Chain 1: Gradient evaluation took 2.5e-05 seconds Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds. Chain 1: Adjust your expectations accordingly! Chain 1: Chain 1: Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 1: Iteration: 10000 / 10000 [100%] (Sampling) Chain 1: Chain 1: Elapsed Time: 0.533343 seconds (Warm-up) Chain 1: 0.5704 seconds (Sampling) Chain 1: 1.10374 seconds (Total) Chain 1: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2). Chain 2: Chain 2: Gradient evaluation took 2.2e-05 seconds Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds. Chain 2: Adjust your expectations accordingly! Chain 2: Chain 2: Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 2: Iteration: 10000 / 10000 [100%] (Sampling) Chain 2: Chain 2: Elapsed Time: 0.526281 seconds (Warm-up) Chain 2: 0.585505 seconds (Sampling) Chain 2: 1.11179 seconds (Total) Chain 2: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3). Chain 3: Chain 3: Gradient evaluation took 2e-05 seconds Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds. Chain 3: Adjust your expectations accordingly! Chain 3: Chain 3: Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 3: Iteration: 10000 / 10000 [100%] (Sampling) Chain 3: Chain 3: Elapsed Time: 0.527449 seconds (Warm-up) Chain 3: 0.595125 seconds (Sampling) Chain 3: 1.12257 seconds (Total) Chain 3: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4). Chain 4: Chain 4: Gradient evaluation took 2.1e-05 seconds Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds. Chain 4: Adjust your expectations accordingly! Chain 4: Chain 4: Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 4: Iteration: 10000 / 10000 [100%] (Sampling) Chain 4: Chain 4: Elapsed Time: 0.525411 seconds (Warm-up) Chain 4: 0.588588 seconds (Sampling) Chain 4: 1.114 seconds (Total) Chain 4:
tidy(pulse_model_2, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | -0.327064681 | 0.232890583 | -0.78064462 | 0.129366642 |
| age | -0.008453545 | 0.003948532 | -0.01638031 | -0.000616473 |
| trump_disapprove | 0.225655382 | 0.137093533 | -0.03961780 | 0.488840621 |
| science_is_honest_somewhat_agree | 0.276340227 | 0.145920437 | -0.01127357 | 0.557672279 |
| science_is_honest_strongly_disagree | 0.494714718 | 0.223413631 | 0.06069460 | 0.930210683 |
Now only science is honest strongly disagree and age are significant..
Third and simplest model:
pulse_model_3 <- stan_glm( ghosts ~ age + science_is_honest_strongly_disagree,
data = pulse_data2, family = binomial,
prior_intercept = normal(0.4, 0.6),
prior = normal(0, 1, autoscale=TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1). Chain 1: Chain 1: Gradient evaluation took 3e-05 seconds Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.3 seconds. Chain 1: Adjust your expectations accordingly! Chain 1: Chain 1: Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 1: Iteration: 10000 / 10000 [100%] (Sampling) Chain 1: Chain 1: Elapsed Time: 0.429696 seconds (Warm-up) Chain 1: 0.501295 seconds (Sampling) Chain 1: 0.930991 seconds (Total) Chain 1: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2). Chain 2: Chain 2: Gradient evaluation took 2.4e-05 seconds Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds. Chain 2: Adjust your expectations accordingly! Chain 2: Chain 2: Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 2: Iteration: 10000 / 10000 [100%] (Sampling) Chain 2: Chain 2: Elapsed Time: 0.439278 seconds (Warm-up) Chain 2: 0.52289 seconds (Sampling) Chain 2: 0.962168 seconds (Total) Chain 2: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3). Chain 3: Chain 3: Gradient evaluation took 2.4e-05 seconds Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds. Chain 3: Adjust your expectations accordingly! Chain 3: Chain 3: Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 3: Iteration: 10000 / 10000 [100%] (Sampling) Chain 3: Chain 3: Elapsed Time: 0.430168 seconds (Warm-up) Chain 3: 0.52817 seconds (Sampling) Chain 3: 0.958338 seconds (Total) Chain 3: SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4). Chain 4: Chain 4: Gradient evaluation took 1.9e-05 seconds Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds. Chain 4: Adjust your expectations accordingly! Chain 4: Chain 4: Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup) Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup) Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup) Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup) Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup) Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup) Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling) Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling) Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling) Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling) Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling) Chain 4: Iteration: 10000 / 10000 [100%] (Sampling) Chain 4: Chain 4: Elapsed Time: 0.423899 seconds (Warm-up) Chain 4: 0.497535 seconds (Sampling) Chain 4: 0.921434 seconds (Total) Chain 4:
CV (use 5 folds for efficiency):
cv_accuracy1 <- classification_summary_cv(model = pulse_model_1, data = pulse_data, cutoff = 0.5, k = 5)
cv_accuracy1$cv
| sensitivity | specificity | overall_accuracy |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0.09386942 | 0.9055878 | 0.597 |
cv_accuracy2 <- classification_summary_cv(model = pulse_model_2, data = pulse_data2, cutoff = 0.5, k = 5)
cv_accuracy2$cv
| sensitivity | specificity | overall_accuracy |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0.0408025 | 0.9750452 | 0.62 |
cv_accuracy3 <- classification_summary_cv(model = pulse_model_3, data = pulse_data2, cutoff = 0.5, k = 5)
cv_accuracy3$cv
| sensitivity | specificity | overall_accuracy |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0.01606765 | 0.9857816 | 0.618 |
It appears to be difficult to compare the three models. While model 2 and 3 are very similar, model 1 appears to have a larger sensitivity at the cost of a lower specificity and at a similar accuracy.